-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coerce types on read #76
Conversation
d0f1554
to
ef6f07d
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #76 +/- ##
==========================================
- Coverage 92.83% 91.94% -0.89%
==========================================
Files 62 62
Lines 7645 8795 +1150
==========================================
+ Hits 7097 8087 +990
- Misses 548 708 +160 ☔ View full report in Codecov by Sentry. |
a343a8a
to
c69fd25
Compare
c69fd25
to
c8af064
Compare
78fc3c9
to
ed3e766
Compare
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema. e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type. We can avoid this situation by casting arrow array to the array that is expected by the tupledesc schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast. To make sure the cast is possible, we need to do 2 checks: 1. arrow-cast allows the cast from "arrow type at the parquet file" to "arrow type at the schema that is generated for tupledesc", 2. the cast is meaningful at Postgres. We check if there is an explicit cast from "Postgres type that corresponds for the arrow type at Parquet file" to "Postgres type at tupledesc". With that we can cast between many castable types as shown below: - INT16 => INT32 - UINT32 => INT64 - FLOAT32 => FLOAT64 - LargeUtf8 => UTF8 - LargeBinary => Binary - Struct, Array, and Map with castable fields, e.g. [UINT16] => [INT64] or struct {'x': UINT16} => struct {'x': INT64} **NOTE**: Struct fields must match by name and position to be cast. Closes #67.
ed3e766
to
b408fac
Compare
src/arrow_parquet/schema_parser.rs
Outdated
pgrx::PgLogLevel::ERROR, | ||
PgSqlErrorCode::ERRCODE_CANNOT_COERCE, | ||
type_mismatch_message, | ||
"Try COPY FROM '..' WITH (cast_mode = 'relaxed') to allow lossy casts with runtime checks." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be (cast_mode 'relaxed'), so drop =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been testing this PR for a while, and all the things just work flawlessly.
As we discussed internally, some CREATE CAST
s may break, such as the following, so nice to document that somehow:
CREATE FUNCTION float_to_text(float) RETURNS text AS $$
BEGIN
RETURN $1::text;
END;
$$ LANGUAGE plpgsql;
CREATE CAST (float AS text)
WITH FUNCTION float_to_text(float)
AS ASSIGNMENT;
COPY (SELECT 123.456::float as a UNION SELECT 789.01) TO '/tmp/float.parquet';
CREATE TABLE parquet_floats (a text);
-- Relaxed mode: uses the custom cast
COPY parquet_floats FROM '/tmp/float.parquet' WITH (cast_mode 'relaxed');
-- Strict mode: the behavior is consistent, but strictness may depend on server-side settings
COPY parquet_floats FROM '/tmp/float.parquet' WITH (cast_mode 'strict');
COPY parquet_floats FROM '/tmp/float.parquet' WITH (cast_mode 'strict');
ERROR: type mismatch for column "a" between table and parquet file.
table has "Utf8"
parquet file has "Float64"
DETAIL: Try COPY FROM '..' WITH (cast_mode = 'relaxed') to allow lossy casts with runtime checks.
ced1479
to
7fa0477
Compare
README.md
Outdated
- `compression_level <int>`: the compression level to use while writing Parquet files. The supported compression levels are only supported for `gzip`, `zstd` and `brotli` compression formats. The default compression level is `6` for `gzip (0-10)`, `1` for `zstd (1-22)` and `1` for `brotli (0-11)`. | ||
|
||
`pg_parquet` supports the following options in the `COPY FROM` command: | ||
- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension, | ||
- `cast_mode <string>`: Specifies the casting behavior, which can be set to either `strict` or `relaxed`. This determines whether lossy conversions are allowed. By default, the mode is `strict`, which does not permit lossy conversions (e.g., `bigint => int` causes a schema mismatch error during schema validation). When set to `relaxed`, lossy conversions are allowed, and errors will only be raised at runtime if a value cannot be properly converted. This option provides flexibility to handle schema mismatches by deferring error checks to runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not permit lossy conversions (e.g.,
bigint => int
why is that a lossy conversion? Does it overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when cast_mode = strict
, it is not allowed at schema comparison time.
when cast_mode = relaxed
, it is allowed but might fail at runtime, if overflow occurs. pg throws error e.g.
pg_parquet=# select 123123123123::bigint::int;
ERROR: integer out of range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the word lossy is confusing. If error is not thrown, it d be lossy but pg throws the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got rid of cast_mode and now always cast in relaxed way.
66ee73a
to
d48996c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be nice to relax it further through coerce via IO, but lots of nice improvements here.
) | ||
.with_metadata(key_field.metadata().clone()); | ||
|
||
let value_nullable = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like it's missing some code coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's because we crunchy_map is missing at ci, we simply skip these tests.
@@ -340,6 +354,14 @@ mod tests { | |||
Spi::get_one(&query).unwrap().unwrap() | |||
} | |||
|
|||
fn write_record_batch_to_parquet(schema: SchemaRef, record_batch: RecordBatch) { | |||
let file = File::create("/tmp/test.parquet").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to use a filename that's less likely to conflict with random stuff users do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be handled at #78
COPY FROM parquet
is too strict when matching the parquet file schema to Postgres tupledesc schema .e.g.
INT32
type in the parquet schema cannot be read into a Postgres column withint64
type.We can avoid this situation by casting arrow array to the array that is expected by the tupledesc
schema, if the cast is possible. We can make use of
arrow-cast
crate, which is in the same projectwith
arrow
. Its public api lets us check if a cast possible between 2 arrow types and perform the cast.To make sure the cast is possible, we need to do 2 checks:
generated for tupledesc", (user created custom cast functions at Postgres won't work by arrow-cast)
With that we can implicitly cast between many types as shown below:
NOTE: Struct fields must always strictly match by name and position.
We can cast below types but with runtime errors e.g. value overflow
Closes #67.
Closes #79.